{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naive Bayes - Trabalho\n",
    "\n",
    "## Questão 1\n",
    "\n",
    "Implemente um classifacor Naive Bayes para o problema de predizer a qualidade de um carro. Para este fim, utilizaremos um conjunto de dados referente a qualidade de carros, disponível no [UCI](https://archive.ics.uci.edu/ml/datasets/car+evaluation). Este dataset de carros possui as seguintes features e classe:\n",
    "\n",
    "** Attributos **\n",
    "1. buying: vhigh, high, med, low\n",
    "2. maint: vhigh, high, med, low\n",
    "3. doors: 2, 3, 4, 5, more\n",
    "4. persons: 2, 4, more\n",
    "5. lug_boot: small, med, big\n",
    "6. safety: low, med, high\n",
    "\n",
    "** Classes **\n",
    "1. unacc, acc, good, vgood\n",
    "\n",
    "## Questão 2\n",
    "Crie uma versão de sua implementação usando as funções disponíveis na biblioteca SciKitLearn para o Naive Bayes ([veja aqui](http://scikit-learn.org/stable/modules/naive_bayes.html)) \n",
    "\n",
    "## Questão 3\n",
    "\n",
    "Analise a acurácia dos dois algoritmos e discuta a sua solução."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Questão 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 533,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Bibliotecas\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split \n",
    "from sklearn.metrics import accuracy_score, classification_report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 534,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>buying</th>\n",
       "      <th>maint</th>\n",
       "      <th>doors</th>\n",
       "      <th>persons</th>\n",
       "      <th>lug_boot</th>\n",
       "      <th>safety</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>vhigh</td>\n",
       "      <td>vhigh</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>small</td>\n",
       "      <td>low</td>\n",
       "      <td>unacc</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>vhigh</td>\n",
       "      <td>vhigh</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>small</td>\n",
       "      <td>med</td>\n",
       "      <td>unacc</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>vhigh</td>\n",
       "      <td>vhigh</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>small</td>\n",
       "      <td>high</td>\n",
       "      <td>unacc</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>vhigh</td>\n",
       "      <td>vhigh</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>med</td>\n",
       "      <td>low</td>\n",
       "      <td>unacc</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>vhigh</td>\n",
       "      <td>vhigh</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>med</td>\n",
       "      <td>med</td>\n",
       "      <td>unacc</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  buying  maint doors persons lug_boot safety  class\n",
       "0  vhigh  vhigh     2       2    small    low  unacc\n",
       "1  vhigh  vhigh     2       2    small    med  unacc\n",
       "2  vhigh  vhigh     2       2    small   high  unacc\n",
       "3  vhigh  vhigh     2       2      med    low  unacc\n",
       "4  vhigh  vhigh     2       2      med    med  unacc"
      ]
     },
     "execution_count": 534,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Lendo o arquivo com os dados\n",
    "df = pd.read_csv('carData.csv', header=None)\n",
    "df.columns = ['buying', 'maint','doors','persons','lug_boot','safety','class']\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 535,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Separa em conjunto de treino (70%) e teste (30%)\n",
    "df_train, df_test = train_test_split(df,test_size=0.3)\n",
    "\n",
    "#Faz uma copia dos mesmos dados para usar na questão 2\n",
    "df_train_sci = df_train.copy()\n",
    "df_test_sci = df_test.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 536,
   "metadata": {},
   "outputs": [],
   "source": [
    "#separa conjunto de teste em features e classe e calcula o tamanho total de treino\n",
    "df = df_train\n",
    "totalSize = len(df)\n",
    "df_test_X = df_test.iloc[:,0:-1]\n",
    "df_test_y = df_test['class']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 537,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#separa os dados por classe\n",
    "for class_value in df['class'].unique():\n",
    "    class_set = df[df['class'] == class_value]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 538,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[[246, 177, 187, 241],\n",
       "  [189, 246, 213, 203],\n",
       "  [236, 216, 201, 198],\n",
       "  [221, 407, 223],\n",
       "  [326, 277, 248],\n",
       "  [407, 249, 195]],\n",
       " [[51, 86, 60, 61],\n",
       "  [80, 52, 69, 57],\n",
       "  [55, 66, 71, 66],\n",
       "  [125, 0, 133],\n",
       "  [76, 86, 96],\n",
       "  [0, 121, 137]],\n",
       " [[0, 18, 31, 0],\n",
       "  [19, 0, 10, 20],\n",
       "  [4, 10, 17, 18],\n",
       "  [23, 0, 26],\n",
       "  [0, 19, 30],\n",
       "  [0, 0, 49]],\n",
       " [[0, 21, 30, 0],\n",
       "  [15, 0, 0, 36],\n",
       "  [11, 13, 14, 13],\n",
       "  [27, 0, 24],\n",
       "  [14, 18, 19],\n",
       "  [0, 30, 21]]]"
      ]
     },
     "execution_count": 538,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Calcula a frequencia de cada atributo de cada feature em cada classe\n",
    "#frequency = [[[[k,len(df[df['class'] == i].index & df[df[j] == k].index)]  for k in df[j].unique()] for j in df.columns[:-1]] for i in df['class'].unique()]\n",
    "frequency = [[[len(df[df['class'] == i].index & df[df[j] == k].index)  for k in df[j].unique()] for j in df.columns[:-1]] for i in df['class'].unique()]\n",
    "frequency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 539,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "frequency_test = [[[k  for k in df[j].unique()] for j in df.columns[:-1]] for i in df['class'].unique()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 540,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>buying</th>\n",
       "      <td>vhigh</td>\n",
       "      <td>med</td>\n",
       "      <td>low</td>\n",
       "      <td>high</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>maint</th>\n",
       "      <td>med</td>\n",
       "      <td>vhigh</td>\n",
       "      <td>high</td>\n",
       "      <td>low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>doors</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>5more</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>persons</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>more</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>lug_boot</th>\n",
       "      <td>small</td>\n",
       "      <td>med</td>\n",
       "      <td>big</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>safety</th>\n",
       "      <td>low</td>\n",
       "      <td>med</td>\n",
       "      <td>high</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>class</th>\n",
       "      <td>unacc</td>\n",
       "      <td>acc</td>\n",
       "      <td>vgood</td>\n",
       "      <td>good</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              0      1      2     3\n",
       "buying    vhigh    med    low  high\n",
       "maint       med  vhigh   high   low\n",
       "doors         2      3  5more     4\n",
       "persons       4      2   more  None\n",
       "lug_boot  small    med    big  None\n",
       "safety      low    med   high  None\n",
       "class     unacc    acc  vgood  good"
      ]
     },
     "execution_count": 540,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Monta tabela de indices para enumerar cada atributo\n",
    "columns = ['buying', 'maint','doors','persons','lug_boot','safety']\n",
    "matrixA={}\n",
    "p = 0;\n",
    "for j in columns:\n",
    "    matrixA[j] = frequency_test[0][p]\n",
    "    p += 1\n",
    "matrixA['class'] = list(df['class'].unique())\n",
    "frequency_label = pd.DataFrame.from_dict(matrixA, orient='index')\n",
    "frequency_label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 541,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#(frequency_label.loc['buying'] == 'vhigh').argmax()\n",
    "#sum(frequency[3][0][:] + frequency[2][0][:] + frequency[1][0][:] + frequency[0][0][:])\n",
    "#frequency[3][0][:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 542,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[0.2456575682382134,\n",
       "  0.24979321753515302,\n",
       "  0.2547559966914806,\n",
       "  0.24979321753515302],\n",
       " [0.2506203473945409,\n",
       "  0.24648469809760132,\n",
       "  0.24152191894127378,\n",
       "  0.261373035566584],\n",
       " [0.2531017369727047,\n",
       "  0.2522746071133168,\n",
       "  0.2506203473945409,\n",
       "  0.24400330851943755],\n",
       " [0.32754342431761785, 0.33664185277088504, 0.3358147229114971],\n",
       " [0.34408602150537637, 0.3308519437551696, 0.3250620347394541],\n",
       " [0.33664185277088504, 0.3308519437551696, 0.3325062034739454]]"
      ]
     },
     "execution_count": 542,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Calcula a probabilidade de cada feature\n",
    "likelihood_feature = [[(frequency[0][index][((frequency_label.loc[j] == k).argmax())] + frequency[1][index][((frequency_label.loc[j] == k).argmax())] + frequency[2][index][((frequency_label.loc[j] == k).argmax())] + frequency[3][index][((frequency_label.loc[j] == k).argmax())])/totalSize for k in df[j].unique()] for index,j in enumerate(df.columns[:-1])]\n",
    "likelihood_feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 543,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.70388751033912322,\n",
       " 0.21339950372208435,\n",
       " 0.040529363110008272,\n",
       " 0.042183622828784122]"
      ]
     },
     "execution_count": 543,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Calcula a probabilidade de cada classe\n",
    "likelihood_class = [(sum(df['class'] == h))/totalSize for h in df['class'].unique()]\n",
    "likelihood_class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 544,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[[0.28907168037602821,\n",
       "   0.20799059929494712,\n",
       "   0.21974148061104584,\n",
       "   0.28319623971797886],\n",
       "  [0.22209165687426558,\n",
       "   0.28907168037602821,\n",
       "   0.25029377203290248,\n",
       "   0.23854289071680376],\n",
       "  [0.27732079905992951,\n",
       "   0.25381903642773207,\n",
       "   0.23619271445358403,\n",
       "   0.23266745005875442],\n",
       "  [0.25969447708578142, 0.47826086956521741, 0.26204465334900118],\n",
       "  [0.38307873090481787, 0.32549941245593422, 0.29142185663924797],\n",
       "  [0.47826086956521741, 0.29259694477085779, 0.2291421856639248]],\n",
       " [[0.19767441860465115,\n",
       "   0.33333333333333331,\n",
       "   0.23255813953488372,\n",
       "   0.23643410852713179],\n",
       "  [0.31007751937984496,\n",
       "   0.20155038759689922,\n",
       "   0.26744186046511625,\n",
       "   0.22093023255813954],\n",
       "  [0.2131782945736434,\n",
       "   0.2558139534883721,\n",
       "   0.27519379844961239,\n",
       "   0.2558139534883721],\n",
       "  [0.48449612403100772, 0.0, 0.51550387596899228],\n",
       "  [0.29457364341085274, 0.33333333333333331, 0.37209302325581395],\n",
       "  [0.0, 0.4689922480620155, 0.53100775193798455]],\n",
       " [[0.0, 0.36734693877551022, 0.63265306122448983, 0.0],\n",
       "  [0.38775510204081631, 0.0, 0.20408163265306123, 0.40816326530612246],\n",
       "  [0.081632653061224483,\n",
       "   0.20408163265306123,\n",
       "   0.34693877551020408,\n",
       "   0.36734693877551022],\n",
       "  [0.46938775510204084, 0.0, 0.53061224489795922],\n",
       "  [0.0, 0.38775510204081631, 0.61224489795918369],\n",
       "  [0.0, 0.0, 1.0]],\n",
       " [[0.0, 0.41176470588235292, 0.58823529411764708, 0.0],\n",
       "  [0.29411764705882354, 0.0, 0.0, 0.70588235294117652],\n",
       "  [0.21568627450980393,\n",
       "   0.25490196078431371,\n",
       "   0.27450980392156865,\n",
       "   0.25490196078431371],\n",
       "  [0.52941176470588236, 0.0, 0.47058823529411764],\n",
       "  [0.27450980392156865, 0.35294117647058826, 0.37254901960784315],\n",
       "  [0.0, 0.58823529411764708, 0.41176470588235292]]]"
      ]
     },
     "execution_count": 544,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Calcula a probabilidade de cada atributo em cada feature em cada classe\n",
    "frequency_prob = [[[len(df[df['class'] == i].index & df[df[j] == k].index)/sum(df['class'] == i)  for k in df[j].unique()] for j in df.columns[:-1]] for i in df['class'].unique()]\n",
    "frequency_prob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 545,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#data = df_test.iloc[0]\n",
    "#data = data[:-1] #esclui a classe\n",
    "#df_test.iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 546,
   "metadata": {},
   "outputs": [],
   "source": [
    "#feature_index = [(frequency_label.loc[i] == data.loc[i]).argmax() for i in data.index]\n",
    "#feature_index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 547,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Faz a predição do conjunto de teste\n",
    "df_test = df_test_X\n",
    "predicts = [\"\" for x in range(len(df_test))]\n",
    "k = 0;\n",
    "for row in range(len(df_test)):\n",
    "    data = df_test.iloc[row]\n",
    "    feature_index = [(frequency_label.loc[h] == data.loc[h]).argmax() for h in data.index]\n",
    "    \n",
    "    class_chance = np.zeros(len(likelihood_class))\n",
    "    for i,prob_class in enumerate(likelihood_class):\n",
    "        prob = 1;\n",
    "        for j,feature in enumerate(feature_index):\n",
    "            prob *= frequency_prob[i][j][feature]\n",
    "        class_chance[i] = (prob * prob_class) #/ likelihood_feature   \n",
    "    predicts[k] = [frequency_label.loc['class'][class_chance.argmax()]]  \n",
    "    k += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 548,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_y = np.squeeze(np.asarray(predicts))\n",
    "correct_values = np.sum(result_y == df_test_y.values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 549,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Porcentagem de acerto = 0.8651252408477842\n"
     ]
    }
   ],
   "source": [
    "correct_pct = correct_values / len(df_test_y)\n",
    "print('Porcentagem de acerto = {0}'.format(correct_pct))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 550,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Classification Report:\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "      unacc       0.74      0.69      0.72       126\n",
      "        acc       0.73      0.44      0.55        18\n",
      "       good       0.91      0.97      0.94       359\n",
      "      vgood       0.86      0.38      0.52        16\n",
      "\n",
      "avg / total       0.86      0.87      0.86       519\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"\\nClassification Report:\")\n",
    "print(classification_report(y_true=df_test_y, y_pred=result_y, target_names=[\"unacc\", \"acc\", \"good\", \"vgood\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Questão 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 551,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import MultinomialNB, GaussianNB\n",
    "from sklearn.preprocessing import LabelEncoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 552,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Transforma os valores string em númericos\n",
    "#Na minha versão foi feito o tratamento de todos os valores sem precisar converter, porém\n",
    "#a função do scikit não é adaptado para valores strings\n",
    "for i in range(0, df_train_sci.shape[1]):\n",
    "    df_train_sci.iloc[:,i] = LabelEncoder().fit_transform(df_train_sci.iloc[:,i])\n",
    "    \n",
    "for i in range(0, df_test_sci.shape[1]):\n",
    "    df_test_sci.iloc[:,i] = LabelEncoder().fit_transform(df_test_sci.iloc[:,i])\n",
    "\n",
    "#Separa em features e classes\n",
    "train_X_sci =  df_train_sci.iloc[:,:-1]\n",
    "test_X_sci =  df_test_sci.iloc[:,:-1]\n",
    "train_y_sci = df_train_sci.iloc[:,-1]\n",
    "test_y_sci = df_test_sci.iloc[:,-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 553,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Faz o fit e o predict com a implementação do scikit\n",
    "nb = GaussianNB()\n",
    "nb.fit(train_X_sci, train_y_sci)\n",
    "result_y_sci = nb.predict(test_X_sci)\n",
    "correct_pct_sci = accuracy_score(test_y_sci, result_y_sci)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 554,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Porcentagem de acerto = 0.6088631984585742\n"
     ]
    }
   ],
   "source": [
    "print('Porcentagem de acerto = {0}'.format(correct_pct_sci))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 555,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Classification Report:\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "      unacc       0.48      0.09      0.15       126\n",
      "        acc       0.00      0.00      0.00        18\n",
      "       good       0.84      0.81      0.82       359\n",
      "      vgood       0.11      1.00      0.19        16\n",
      "\n",
      "avg / total       0.70      0.61      0.61       519\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Carlos\\Miniconda3\\envs\\carlos\\lib\\site-packages\\sklearn\\metrics\\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.\n",
      "  'precision', 'predicted', average, warn_for)\n"
     ]
    }
   ],
   "source": [
    "print(\"\\nClassification Report:\")\n",
    "print(classification_report(y_true=test_y_sci, y_pred=result_y_sci, target_names=[\"unacc\", \"acc\", \"good\", \"vgood\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Questão 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 556,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Resultado no meu algoritmo: 0.8651252408477842\n",
      "Resultado no algoritmo do sklearn: 0.6088631984585742\n"
     ]
    }
   ],
   "source": [
    "print('Resultado no meu algoritmo: {0}'.format(correct_pct))\n",
    "print('Resultado no algoritmo do sklearn: {0}'.format(correct_pct_sci))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Na primeira etapa são somadas as quantidades de ocorrências de cada atributo. Em seguida é criado um mapa de índices para ser possível de identificar os valores de probabilidades armazenados, de acordo com os labels string. Então é calculado a probabilidade de cada feature e posteriormente a probabilidade de cada classe, e a probabilidade de cada atributo que foi calculado na etapa inicial. \n",
    "\n",
    "Com todas as probabilidades a etapa de predição consiste, basicamente, em fazer o seguinte cálculo de probabilidades, já armazenadas.\n",
    "\n",
    "$$ v_{NB} = argmax(P(v_{j}) \\prod_{i} P(A_{i}|v_{j})) $$\n",
    "\n",
    "Algumas observações podem ser feitas sobre resultados. O algoritmo do sklearn apresentou um desempenho abaixo do algoritmo implementado. O resultado dos dois algoritmos para o mesmo conjunto de dados, podem ser observados acima. Isso pode ter acontecido devido a transformação dos dados em numéricos para o algoritmo do sklearn, podendo ter sido realizado alguma medida, em que os dados foram mal ponderados pelos valores assumidos. Já no algoritmo implementado, no qual as probabilidades foram calculadas corretamente, e sem alterar os valores das features, o resultado sempre ficou próximo dos 85%. Considerando que um resultado razoável para esta base é de 70% a 76%, o resultado ficou bem acima do esperado. \n",
    "\n",
    "É possível observar que as classes 'vgood' e 'acc' tiveram uma taxa de acerto baixa, principalmente no algoritmo do sklearn, isso deve ter acontecido devido ao desbalanceamento entre as classes, e pelo algoritmo utilizar apenas dados de probabilidades. Algumas classes podem apresentar valores de características bem representativos e já outras nem tanto. Fazendo com que as predições caiam nas classes bem representadas."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
